Welcome back to deep learning and today we want to talk about the last part of
object detection and segmentation and we want to look into the concept of
instant segmentation. So let's have a look at our slides you see this is
already the last part part five and now we want to talk about instant
segmentation. So we not just want to detect where pixels with cubes are or
where pixels of cups are but we want to really figure out which pixels belong to
what cube. So this is essentially a combination of object detection and
semantic segmentation. Examples for potential applications are information
about occlusion, counting, the number of elements belonging to the same class,
detecting object boundaries for example of gripping objects and robotics this is
very important and there's examples in the literature simultaneous detection and
segmentation, deep mask, sharp mask and mask RCNN and reference 10. So let's look
at reference 10 in a little more detail. So we essentially go back to the start
we combine the object detection and the segmentation and we use RCNN for the
object detection and the object detection essentially solves the
instance separation and then the segmentation refines the bounding boxes
per instance. So the workflow is a two-stage procedure you have the region
proposal that proposes the object bounding boxes and then you have the
classification using a bounding box regression and the segmentation in
parallel. So you have a multitask loss that essentially combines the pixel-wise
classification loss so the segmentation loss, the box loss and the class loss for
producing the right class per bounding box. So you have these three terms that
are then combined in a multitask loss. So let's look in some more detail into the
two-stage procedure. You have two different options here for two-stage
networks you can have a joint branch that is working on the ROIs and then
splits at a later stage into the segmentation of the mask and the class
and bounding box prediction or you can split early and then run that into
separate networks. In both versions you have this multitask loss that combines
the pixel-wise segmentation loss, the box loss and the class loss. Let's have a
look at some examples and these are results again from mask RCNN and you
can see that to be honest these are quite impressive results. So there are
really difficult cases you identify where the persons are and you also show
that the different persons of course are different instances. So very impressive
results. So let's summarize what we've seen so far. The segmentation is
commonly solved by architectures analyzing the image and subsequently
refining the course results. Fully convolutional networks preserve the
spatial layout and enable arbitrary input sizes with pooling. We can use
object detectors and implement them as a sequence of region proposals and
classification then this leads essentially to the family of RCNN type
of networks. Alternatively you can go to single shot detectors and we looked at
YOLO which is a very common and very fast technique YOLO 9000 and we looked
into retina net if you really have a scale dependency and you want to detect
on many different scales like for the example of histological slice
processing. So object detection and segmentation are closely related and
combinations are common as you have seen here for the purpose of instant
segmentation. So let's look at what we still have to talk about in this lecture
and coming up very soon is methods to relieve the burden of labeling. So we
will talk about weekly annotation, how we can generate labels which then also
leads to the concept of self-supervision which is a very popular topic right now
and it's been very heavily used in order to generate better networks in order to
Presenters
Zugänglich über
Offener Zugang
Dauer
00:07:13 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 22:06:37
Sprache
en-US
Deep Learning - Segmentation and Object Detection Part 5
In this video, we look at instance segmentation and introduce the concepts of Mask-RCNN.
For reminders to watch the new video follow on Twitter or LinkedIn.
Additional References
nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation
X-ray-transform Invariant Anatomical Landmark Detection for Pelvic Trauma Surgery
Retina-net Figure by Marc Aubreville
DarkNet Library
Joseph Redmond CV
Further Reading:
A gentle Introduction to Deep Learning
References
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation”. In: arXiv preprint arXiv:1511.00561 (2015). arXiv: 1311.2524.
[2] Xiao Bian, Ser Nam Lim, and Ning Zhou. “Multiscale fully convolutional network with application to industrial inspection”. In: Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE. 2016, pp. 1–8.
[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, et al. “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”. In: CoRR abs/1412.7062 (2014). arXiv: 1412.7062.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs”. In: arXiv preprint arXiv:1606.00915 (2016).
[5] S. Ren, K. He, R. Girshick, et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: vol. 39. 6. June 2017, pp. 1137–1149.
[6] R. Girshick. “Fast R-CNN”. In: 2015 IEEE International Conference on Computer Vision (ICCV). Dec. 2015, pp. 1440–1448.
[7] Tsung-Yi Lin, Priya Goyal, Ross Girshick, et al. “Focal loss for dense object detection”. In: arXiv preprint arXiv:1708.02002 (2017).
[8] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, et al. “A Review on Deep Learning Techniques Applied to Semantic Segmentation”. In: arXiv preprint arXiv:1704.06857 (2017).
[9] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, et al. “Simultaneous detection and segmentation”. In: European Conference on Computer Vision. Springer. 2014, pp. 297–312.
[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). arXiv: 1703.06870.
[11] N. Dalal and B. Triggs. “Histograms of oriented gradients for human detection”. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Vol. 1. June 2005, 886–893 vol. 1.
[12] Jonathan Huang, Vivek Rathod, Chen Sun, et al. “Speed/accuracy trade-offs for modern convolutional object detectors”. In: CoRR abs/1611.10012 (2016). arXiv: 1611.10012.
[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3431–3440.
[14] Pauline Luc, Camille Couprie, Soumith Chintala, et al. “Semantic segmentation using adversarial networks”. In: arXiv preprint arXiv:1611.08408 (2016).
[15] Christian Szegedy, Scott E. Reed, Dumitru Erhan, et al. “Scalable, High-Quality Object Detection”. In: CoRR abs/1412.1441 (2014). arXiv: 1412.1441.
[16] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1520–1528.
[17] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, et al. “Enet: A deep neural network architecture for real-time semantic segmentation”. In: arXiv preprint arXiv:1606.02147 (2016).
[18] Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. “Learning to segment object candidates”. In: Advances in Neural Information Processing Systems. 2015, pp. 1990–1998.
[19] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, et al. “Learning to refine object segments”. In: European Conference on Computer Vision. Springer. 2016, pp. 75–91.
[20] Ross B. Girshick, Jeff Donahue, Trevor Darrell, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CoRR abs/1311.2524 (2013). arXiv: 1311.2524.
[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: MICCAI. Springer. 2015, pp. 234–241.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. In: Computer Vision – ECCV 2014. Cham: Springer International Publishing, 2014, pp. 346–361.
[23] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, et al. “Selective Search for Object Recognition”. In: International Journal of Computer Vision 104.2 (Sept. 2013), pp. 154–171.
[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al. “SSD: Single Shot MultiBox Detector”. In: Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016, pp. 21–37.
[25] P. Viola and M. Jones. “Rapid object detection using a boosted cascade of simple features”. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision Vol. 1. 2001, pp. 511–518.
[26] J. Redmon, S. Divvala, R. Girshick, et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016, pp. 779–788.
[27] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In: CoRR abs/1612.08242 (2016). arXiv: 1612.08242.
[28] Fisher Yu and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions”. In: arXiv preprint arXiv:1511.07122 (2015).
[29] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, et al. “Conditional Random Fields as Recurrent Neural Networks”. In: CoRR abs/1502.03240 (2015). arXiv: 1502.03240.
[30] Alejandro Newell, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation”. In: European conference on computer vision. Springer. 2016, pp. 483–499.